Explainable automated pain recognition in cats

Manual tools for pain assessment from facial expressions have been suggested and validated for several animal species. However, facial expression analysis performed by humans is prone to subjectivity and bias, and in many cases also requires special expertise and training. This has led to an increasing body of work on automated pain recognition, which has been addressed for several species, including cats. Even for experts, cats are a notoriously challenging species for pain assessment. A previous study compared two approaches to automated ‘pain’/‘no pain’ classification from cat facial images: a deep learning approach, and an approach based on manually annotated geometric landmarks, reaching comparable accuracy results. However, the study included a very homogeneous dataset of cats and thus further research to study generalizability of pain recognition to more realistic settings is required. This study addresses the question of whether AI models can classify ‘pain’/‘no pain’ in cats in a more realistic (multi-breed, multi-sex) setting using a more heterogeneous and thus potentially ‘noisy’ dataset of 84 client-owned cats. Cats were a convenience sample presented to the Department of Small Animal Medicine and Surgery of the University of Veterinary Medicine Hannover and included individuals of different breeds, ages, sex, and with varying medical conditions/medical histories. Cats were scored by veterinary experts using the Glasgow composite measure pain scale in combination with the well-documented and comprehensive clinical history of those patients; the scoring was then used for training AI models using two different approaches. We show that in this context the landmark-based approach performs better, reaching accuracy above 77% in pain detection as opposed to only above 65% reached by the deep learning approach. Furthermore, we investigated the explainability of such machine recognition in terms of identifying facial features that are important for the machine, revealing that the region of nose and mouth seems more important for machine pain classification, while the region of ears is less important, with these findings being consistent across the models and techniques studied here.

According to the International Association for the Study of Pain (IASP), pain is an "unpleasant sensory and emotional experience associated with, or resembling that associated with, actual or potential tissue damage" 1 . It is particularly important to recognize that "verbal description is only one of several behaviors to express pain; inability to communicate does not negate the possibility that a human or a nonhuman animal experiences pain". However, in the absence of verbal indications from patients, the accurate assessment of an individual's pain relies upon the inferences made by clinicians. Given the lack of standardised and objectively applicable tools to assess pain in such contexts 2 , this process is inherently challenging and a ubiquitous problem regarding non-human animals due to their non-verbal status 3 . Surveys in the veterinary profession clearly indicate that the lack of such tools may well interfere with an accurate assessment and classification and thus appropriate treatment of pain. For instance, a study of attitudes and beliefs of Queensland veterinarians in relation to postoperative pain and preoperative analgesia in dogs revealed that nearly one-fifth of respondents doubted their confidence in their knowledge about post surgical pain; 42% acknowledged difficulties recognising pain, and nearly one-quarter were unsure or negative about the capacity of veterinarians to recognise pain 4 . These findings were also supported in a study investigating the attitudes of veterinary practitioners in New Zealand to pain and analgesia in cats and

1.
To what extent can a machine recognize pain in cats in a more naturalistic or 'noisy' population (e.g. variations in breed, sex and painful conditions)? We address this question by repeating and expanding the scope of the comparative study outlined in 27 using two approaches to the automatization of cat pain recognition (landmark-based and deep learning based) on a new dataset of 84 client-owned cats presented to the Department of Small Animal Medicine and Surgery of the University of Veterinary Medicine Hannover. Different breeds with varying age, sex, and medical history were included; the cats were also scored using the Glasgow composite measure pain scale (CMPS) by veterinarians to provide an indication of degree to which pain was present using this previously validated behaviour based tool. 2. Which facial features are most important for the machine in relation to pain recognition performance? We address this question by using explainable AI (XAI 34 ) methods to investigate the roles played by different cat face regions: ears, eyes, mouth, and nose in machine pain recognition.

Results
For narrative purposes we preface our results with essential and practical aspects to improve understanding for those less familiar with AI methods, presenting a high-level overview of the used approaches, as well as with the dataset description.
Overview. Figure 1 presents a high-level overview of the two pipelines for the deep learning (DL) and landmark-based (LDM) approaches used in this study. Both of the pipelines start with cat facial alignment, using the method described in Feighelstein et al. 27 , which is based on manual landmark annotations. The aligned images are then fed to the deep learning models as is, while the landmark-based approach uses the XY locations of the 48 landmarks, which serve as cat face "abstractions". These landmarks are then used to create multi-vectors according to cat facial regions capturing ears, nose/mouth, eyes, as described in Feighelstein et al. 27 . These vectors form the final input to the machine learning models (Multilayer Perceptron and Random Forest are used here). Cats were recorded in a cage, where they were free to move (and hide themselves), having also free access to water and food during the whole hospitalisation period, as well as to a litter box inside their cage. The cats were captured using a mobile phone video recorder using a self-developed app, from which the best frames (recording distance approximately 10 cm with cat facing camera) were extracted. Example images are presented on Figure 2. Any presented cat was in principle eligible for the study. Cats of different breeds, ages, sex, and medical history were included. Brachycephalic cats, who have an extreme facial conformity (compared to mesocephalic cats), as well as cats with facial wounds or patients with neurological diseases that affect the facial expression were excluded.   www.nature.com/scientificreports/ The cats were scored during clinical examination using the CMPS-feline instrument 23 in their cage at least half an hour after the last clinical examination, in order to enable a rest period and to reduce scoring bias. The CMPS-feline instrument includes seven categories, referring to changes in the cat's behavior as well as in the cat's face. A total maximum of 20 points is possible, with scores ≥ 5 considered an intervention threshold 23 . In this study, the images were divided into two classes henceforth referred to as 'pain' and 'no pain' . Cats with CMPS scores of 4 were excluded to allow a clearer distinction between 'pain' and 'no pain' classes. Moreover, cats with CMPS scores of ≥ 5 which had no clinical reason to suspect pain were also excluded. This led to Class 1 ('pain') including 42 cats satisfying the following two conditions: (i) with CMPS scores of ≥ 5, and (ii) with clinical reasons to suspect pain. The clinical reasons for suspected pain of the cats in Class 1 are listed in Table 2. The most frequent reasons for presentation were various bone fractures (e.g. of the femur, pelvis or humerus), followed by gastrointestinal foreign bodies and surgery and problems concerning the urinary tract. Class 2 (i.e. 'no pain') was balanced with 42 cats using random undersampling and included cats who satisfied the following two conditions: (i) CMPS scores of <4, and (ii) with no known clinical reason to suspect pain. Only one sample frame of an individual was included in each of the two classes (see Fig. 4).
Tables 1 and 2 present the list of the participants in the two classes, presenting demographic information including sex, neuter status, breed, age and clinical condition which was the reason for presentation at the clinic.
For the LDM approach, the images were manually annotated with 48 landmarks, following the approach in Finka et al. and Feighelstein et al. 26,27 , which were specifically chosen for their relationship with underlying musculature, and relevance to cat-specific facial Action Units (catFACS 35 ). For the specific location of each landmark, see Fig. 3.

Model performance.
For measuring performance of models, we use standard evaluation metrics of accuracy, precision, recall (see, e.g., Lencioni et al 36 for further details). As a validation method 37 , we use 10-fold cross validation with no subject overlap. This method is recommended 38 whenever the dataset contains no more than one sample of each individual. Table 3 presents the results of the comparison of the performance of different models (two types for each approach), with and without alignment and augmentation, which are techniques of data pre-processing that can potentially improve performance. It can be seen that the landmark-based approach performs better, with the Random Forest (RF) model reaching accuracy above 77% in pain/no pain classification as opposed to only above 65% reached by the ResNet model.

Facial parts importance.
Explainable AI methods can be roughly divided into two types 39,40 , as demonstrated on Fig. 4: data-focused and model-focused.
Data-focused explainability. In this approach, the idea is to occlude information on different facial regions from the model, exploring the impact of different regional occlusions on model classification accuracy. In the context of the importance of cat facial parts, we define the following general notions for occlusion configurations for a particular face region: • 'Full information': the model is trained and tested using information from all regions; R • 'Reveal only R': the model is trained and tested on information f from only one specific region; R • 'Hide R': the model is trained and tested on information from all regions, excluding one specific region. Figure 5 demonstrates the occlusion configurations for each of the three regions (ears, eyes, mouth). It should be noted, however, that the relationships between the accuracies in the two configurations are not linear: having a good performance in a model exposed only to ears, does not necessarily imply having low performance when exposed to eyes and mouth only. Another thing that should be noted is that if we extend the notion of these configurations from single region to sets of regions, then there is a direct link between the two configurations: e.g., 'hide' configuration for eyes is equivalent to 'reveal only' configuration for ears and mouth. It should also be noted that in the LDM approach the input to the model is derived from manually annotated landmark information based on XY coordinates, while in the DL approach it is raw image pixel-based information. Thus the "occlusion" processes applied to these two different models are performed on different units of information.
The LDM approach The units of information here are vectors (ordered pairs of (x,y) coordinates of the landmarks) in different facial regions, and 'occlusion' is achieved by excluding vectors belonging to a certain facial region. Tables 4 and 5 present the classification results using different occlusion configurations for the Random Forest and MPL classifiers respectively. There is agreement between the two classifiers that hiding ears gives very good (roughly as "all") accuracy, while using only ears has low accuracy. Moreover, there is also agreement that using only mouth gives good accuracy, while by hiding mouth accuracy drops (compared to "all").
The DL approach The units of information here are raw pixels, and 'occlusion' is achieved by hiding different combinations of face mask regions (ears, eyes and mouth). As the dataset is aligned, having all eye centers located on same image position, we identify the eye mask area for all images as the area captured between the minimal and maximal y coordinate of any eye landmark. The ear mask region starts at the upper border of the image and ends at the top of the eye region. The mouth mask region starts at the bottom of the eye band and ends on the bottom of the image. We decided to use general masks for all the images instead of tailoring different regional masks per image according to their landmarks, in order to prevent that the deep learning model will obtain any information from the particular location of the tailored masks. www.nature.com/scientificreports/ Table 6 presents the classification results using different occlusion configurations. As in the LDM approach, in DL hiding ears still gives good (relative to "all") accuracy, while using only ears has lower accuracy. Moreover, using only mouth also gives good (relative to "all") accuracy, while by hiding mouth accuracy drops (compared to "all").
Model-focused explainability. These methods are based on extracting information from the model itself, e.g. information on feature relevance such as using back-propagation algorithms in neural networks, or feature importance rating in tree-based models.
The LDM approach In this approach the use of Random Forest models allows for extracting information on feature importance 41 for each of the landmarks. More specifically, we utilize the Gini Importance or Mean Decrease in Impurity (MDI) metric 42 that calculates each feature importance as the sum over the number of splits (accross all trees) that include the feature, proportionate to the number of samples it splits 43 . Once the www.nature.com/scientificreports/ model is trained, we calculate the individual landmark importance as the sum of the feature importance of its input coordinates x and y. Figure 6 presents the feature importance of all the 48 landmarks (aggregated over all images), with red colors indicating more important landmarks and the deepness of the red color reflecting relative importance, with the majority of most important landmarks appearing in the mouth area.
The DL approach In the DL approach, we employ one of the most commonly used approaches is the GradCAM method 44,45 to visualize heatmaps, showing the 'attention' areas of the trained ResNet50 network. The availability of landmark annotations from the LDM allows also for a more sophisticated quantitative analysis of the heatmaps, quantifying the degree of attention (heat) of the model per landmark (Fig. 7) and per face region (Fig. 8). This shows mouth and eyes are clearly more "informative" for the classifer than ears. Table 9 presents a summary of indications consistent across both LDM and DL approaches, showing that the mouth is most important, and ear are the least important facial part for the classifiers.
Figs. 9 and 10 present examples of GradCAM heatmaps extracted from images within our dataset. The hotter (deeper red) the pixel appears to be in a heatmap, the more attention is given to it by the model for pain/no pain classification. The colder (more blue) pixels are those receiving less attention from the model.

Discussion
Feighelstein et al. 27 showed that the LDM and DL approaches performed comparably well on a single-breed, single-sex, single condition data set, with both models reaching accuracy above 72%. The current study provides further indication for the success of the LDM approach, reaching an improved performance rate of above 77% on a more heterogeneous data population. The DL approach, on the other hand, is less successful on this more diverse dataset, reaching only around 65% accuracy. This drop in performance of the DL approach is however most likely due to the current dataset being much smaller than that of Feighelstein et al. 27 (464 images in the previous study as opposed to 84 here), given that deep learning approaches tend to be data-hungry. Thus investigating whether the performance of the DL approach is improved by enlarging the dataset is an immediate priority for future research. Landmark-based approaches are by their nature better able to directly measure and thus better account for variability in morphology of the cat faces (as opposed to DL approaches which use raw pixel data and may be "confused" by this variability), which could explain their robustness on this dataset. Another important difference between the study of Feighelstein et al. 27 and this study is the ground truth labelling of pain/no pain classes. Broomé et al. 38 reviews labelling methods in the context of automated recognition of animal affect and pain, dividing into two main ways: behavior-based or stimulus-based state annotations. The former are purely based on the observed behaviors, and are usually scored by human experts. For the latter, the ground-truth is based on whether the data were recorded during an ongoing stimulus or not. In Feighelstein et al. 27 , the time  26 .
Landmarks appear contralateral to their origin, as they would when directly observing the cat's face. www.nature.com/scientificreports/ points when the images were taken provide a stimulus-based method, as the participant's images were captured after ovariohysterectomy at different time points corresponding to varying (controlled) intensities of pain (i.e. pre or post op and pre and post rescue analgesia). In the current study however, images of cats' faces were recorded in a real-life veterinary context where pain was naturally occurring rather than clinically induced/controlled and 'pain/no pain' labelling was derived from a subsequently conducted behavior-based assessment method, (the CMPS-feline 23 , based on real-time human-inferences of cat behavioural elements and facial changes. On the more technical side, in the current study augmentation did not significantly improve model performance, which is in line with the findings in Feighelstein et al. 27 . Using Random Forest as a base model improved performance as compared to using MPL in the LDM approach. The use of multi-region vectorization led to improved performance in the LDM approach. The vectors were defined based on the cat face regions as defined by the FGS 15 , and thus they seem to "guide" the model in "looking" within each region separately, without linking anatomically unrelated landmarks. In this way the vectors can be efficient in holistically capturing the outputs of subtle differences in the relative positioning of underlying facial musculature that may occur as a consequence of the micromovements of the muscle contractions in cats' faces. Vector based approaches thus provide a more efficient geometric morphometric representation of the cat face for pain recognition than just using the set of landmarks with no connections between them. To summarize, our first findings suggest that in relation to pain/no pain discrimination accuracy, the annotation approach using landmarks is potentially more robust for use on noiser more naturalistic populations and where resulting datasets are of a modest size. However, the downside of taking this route is the resource and effort needed for landmark annotation given this is currently required to be completed manually. Thus one natural direction for future research is the automation of detection and annotation of cat facial landmarks. While a  www.nature.com/scientificreports/ vast body of work addresses this problem for human faces (see Wu and Ji 46 for a review), the topic of landmark localization for animal faces is currently understudied. Development of such methods for cats will provide an essential step toward accurate automated cat pain recognition in clinical and other practical settings and may pave the way for subsequent cross-species application.
A further important finding of this study is summarized in Table 9, showing a striking consistency across approaches (LDM vs. DL, RF vs. MPL) with respect to occlusion experiments: using only information on the ears leads to low performance, while using only information on the mouth still delivers high performance. Moreover, hiding ears improves performance, while hiding the mouth decreases performance. This is further strengthened by the feature importance information extracted both in LDM and DL approaches (Table 7): features related to the ears appear to be the least important, while features related to the mouth appear to be the most important in both cases.
While a possible interpretation of this finding might be that the cat's mouth is more expressive than other facial regions, in Evangelista et al. 15 the cat's ears were reported as a more reliable visual indicator during human FGS scoring compared to the eyes (i.e., the ears had better internal consistency). Thus, an alternative and more probable explanation could be that the mixed-breed dataset used in the current study introduced greater baseline noise concerning the general shape and size of ears (i.e., Finka et al. 29 ) than could be handled by the machine www.nature.com/scientificreports/ learning approaches in order to use these features to reliably classify images based on pain presence/absence. However, low performance of the ears could also be attributed to other features associated with the specific dataset used in this study such as the way images were collected (i.e. the angle of the camera relative to the cat, or lighting conditions etc). The potential impact of such factors should be investigated in future studies. Another point worth noting is that this finding could also be related to the static (image-based) analysis performed in this study; in future investigations it should be checked whether it is also preserved in video-based approaches. One immediate research priority is therefore to investigate whether it is indeed the case that the machine "sees" pain differently to humans. One way to proceed would be to compare machine classification to human expert performance using methods such as face masking, similar to the idea used in the works [47][48][49] .
A limitation of the current study that should be mentioned is the size of the dataset used, as well as a majority of male (two thirds) cats in it. Another limitation is the use of photos, which capture just one momentary facial expression. As already mentioned above, the use of video data in the development of AI models can enable the analysis of both facial expressions and behavioral indicators of pain by taking into account the temporal dimension. As such approaches tend to be significantly data-hungry 38 , expanding the available datasets on cat pain from images to videos should be a priority for the development of AI models suitable for clinical settings.
The results presented in this study further support the indication from Feighelstein et al. 27 that AI-assisted recognition of negative affective states such as pain from cat faces is feasible. www.nature.com/scientificreports/ However, negative affective states can also be associated with other distressful conditions (e.g., anaemia, nausea). In order to differentiate pain from these conditions, further data acquisition with appropriate diagnostics is necessary. For this reason, the correlation of sampled footage with the corresponding clinical records is essential for the development of clinically supportive and multifaceted tools to differentiate painful and non-painful conditions causing a negative affective state. Due to the lack of verbal communication in animals, further development and optimization of these tools can be an important contribution to the adequate treatment of pain in cats. For www.nature.com/scientificreports/ this purpose, further data are necessary in order to guarantee appropriate generalizability of automated pain recognition especially among different cat breeds, medical conditions, technical possibilities, and environments. However, AI systems should be seen as a complement to and not a replacement of clinical judgement skills, with the potential to increase awareness of cases requiring greater attention and care.

Reliability of annotation.
To establish the reliability of the landmark annotation process, a second person manually annotated more than 10% of images from the dataset, using the same annotation instructions. Images used for reliability analysis were selected pseudo randomly, so that contributions were balanced across individuals and conditions. At the point of annotation, both annotators were blinded to the condition from which each image was drawn. Inter-annotator reliability for the 96 XY coordinates was determined via the Inter Class    www.nature.com/scientificreports/ Correlation Coefficient ICC2 (a measure of absolute agreement between raters 50 ), and reached the threshold for ICC2 acceptability.
Model training.
• DL The approach was as per Feighelstein et al. 27 , we apply transfer learning on a Resnet50 model pre-trained on ImageNet, adding a new sub network compound on top of the last layer with the parameters specified in this study 27 . • LDM The approach was as per Feighelstein et al. 27 , we trained a Multi Layer Perceptron neural network (MLP), consisting of an input layer containing 96 neurons (one for each x and y coordinate obtained via the 48 landmarks) with the parameters specified in this study 27 . Additionally, due to its supporting feature importance extraction 41 , we trained also a Random Forest model, optimizing accuracy while ranging maximal depth (MaxDepth) of trees between 1 and 40 and number of estimators (Trees) from 1 to 250 in intervals of 5. Optimal parameters MaxDepth and Trees for each input configuration are specified in Table 3.

Average heat calculation.
To calculate the average heat per face region (see Table 8 and Fig. 8

Data availability
The dataset is available from the corresponding authors upon request.